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Abstract: Gaussian Graphical Models provide a convenient framework 
for representing dependencies between variables. Recently, this tool has re- 
^ ceived a high interest for the discovery of biological networks. The literature 

- focuses on the case where a single network is inferred from a set of mea- 

^ surements, but, as wet lab data is typically scarce, several assays, where 

V N the experimental conditions affect interactions, are usually merged to infer 

a single network. In this paper, we propose two approaches for estimat- 
^ ing multiple related graphs, by rendering the closeness assumption into an 

^^i^ empirical prior or group penalties. We provide quantitative results demon- 

strating the benefits of the proposed approaches. The methods presented 
in this paper are embeded in the R package simone from version 1.0-0 and 
CN later. 
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1. Motivations 

Systems biology provides a large amount of data sets that aim to understand 
the complex relationships existing between the molecular entities that drive any 
biological process. Depending on the molecule of interest, various networks can 
be inferred, e.g., gene-to-gene regulation networks or protein-protein interaction 
networks. The basic idea is to consider that if two molecules interact, a statistical 
dependency between their expression should be observed. 

A convenient model of multivariate dependence patterns is Gaussian Graph- 
ical Modeling (GGM). In this framework, a multidimensional Gaussian variable 
is characterized by the so-called concentration matrix, where conditional in- 
dependence between pairs of variables is characterized by a zero entry. This 
matrix may be represented by an undirected graph, where each vertex repre- 
sents a variable, and an edge connects two vertices if the corresponding pair of 
random variables are dependent, conditional on the remaining variables. 

Merging different experimental conditions from wetlab data is a common 
practice in GGM-based inference methods (Toh and Horimoto 2002, Schafer and 
Strimmer 2005). This process enlarges the number of observations available for 
inferring interactions. However, GGMs assume that the observed data form an 
independent and identically distributed (i.i.d.) sample. In the aforementioned 
paradigm, assuming that the merged data is drawn from a single Gaussian 
component is obviously wrong, and is likely to have detrimental side effects in 
the estimation process. 

In this paper, we propose to remedy this problem by estimating multiple 
GGMs, each of whom matching different modalities of the same set of vari- 
ables, which correspond here to the different experimental conditions. As the 
distributions of these modes have strong commonalities, we propose to estimate 
these graphs jointly. Considering several tasks at a time has been attracting 
much attention in the machine learning literature, where the generic problem is 
usually referred to as "multi-task learning" ( ^ruana 1997). rjirun (200P) used 
the terms "indirect evidence" and "learning from the experience of others" for 
similar ideas. The principle is to learn an inductive bias, whose role is to sta- 
bilize the estimation process, hopefully enabling more accurate predictions in 
small sample size regimes (Baxter 2000). The techniques comprise the empirical 
and hierarchical Bayes methodologies and regularization schemes (see irgyriou 
ct ai. 200'^, for example), which may be interpreted as approximations to the 
latter. Here, we will mainly follow the penalty-based approach to leverage the 
inference of several related graphs towards a common pattern. 

A typical example of this problem arises when inferring gene interactions 
from data measured on slightly different stem cells, such as wild and mutant. It 
is reasonable to assume that most, though not all, interactions will be common 
to both types of cells. The line of attack we propose alleviates the difficulties 
arising from the scarcity of data in each experimental condition by coupling the 
estimation problems. Our first proposal biases the estimation of the concentra- 
tion matrices towards a common value. Our second proposition focuses on the 
similarities in the sparsity pattern that are more directly related to the graph 
itself. We propose the Cooperative-LASSO, which builds on the Group-LASSO 
(Yuan ana Lin 2006) to favor solutions with a common sparsity pattern, but 
encodes a further preference towards solutions with similar sign patterns, thus 
preserving the type of co-regulation (positive or negative) across assays. 
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To our knowledge, the present work is the first to exploit the multi-task learn- 
ing framework for learning GGMs. However, coupling the estimation of several 
networks has recently been investigated for Markov random fields, in the context 
of time-varying networks. Kolar et al. (2009) propose two specific constraints, 
one for smooth variations over time, the other one for abrupt changes. Their 
penalties are closer to the Fused-LASSO and total variations penalties than to 
the group penalties proposed here. 

2. Network Inference with GGM 

In the GGM framework, we aim to infer the graph of conditional depen- 
dencies among the p variables of a vector X from independent observations 
(X^, . . . , X^). We assume that X is a p-dimensional Gaussian random variable 
X A/'(0p,5]). Let K = X)"^ be the concentration matrix of the model; the 
non-zero entries of Kij indicate a conditional dependency between the variables 
Xi and Xj, and thus define the graph Q of conditional dependencies of X. 

The GGM approach produces the graph Q from an inferred K. The latter 
cannot be obtained by maximum likelihood estimation that would typically re- 
turn a full matrix, and hence a useless fully connected graph. To produce sparse 
networks, Banerjee et al. (2008) propose to penalize the entries of K by an ii- 
norm. Friedman et al. (2008) latter addressed the very same problem with an 
elegant algorithm named the graphical-LASSO. Their well- motivated approach 
produces a sparse, symmetric and positive-definite estimate of the concentration 
matrix. However, a cruder though more direct approach has been reported to be 
more accurate in terms of edge detection (V iiiers et al. 2008, Koclia ct ai. zuu5). 
This approach, proposed by Meinshausen and Biihlmann (2006) and known as 
neighborhood selection^ determines Q via an iterative estimation of the neigh- 
borhood of its nodes. For this purpose, it considers p independent ^i-penalized 
regression problems. Let X be the n x p matrix of stacked observations, whose 
kth row contains (X^)""". The vertices adjacent to vertex i are estimated by the 
non-zero elements of /3 solving 

min i||X,-X\,/3||5 + A||/3||i , (1) 

where is the ith column of X and X\^ is X deprived of its ith column: the 
ith variable is "explained" by the remaining ones. As the neighborhood of the 
p variables are selected separately, a post-symmetrization must be applied to 
manage inconsistencies between edge selections; Meinshausen and Biihlmann 
suggest AND or OR rules, which are both asymptotically consistent (as n goes 
to infinity). 

Solving the p regression problems (1) may be interpreted as inferring the 
concentration matrix in a penalized, pseudo maximum likelihood framework, 
where the joint distribution of X is approximated by the product of the p 
distributions of each variable conditional on the other ones (Rocha et al. 2008, 
Ambroise et al. 2009, Ravikumar et al. 2010), that is 

p / n 

£(K|X) = ^ K]logP(Xf IX^'.jK,) 

i = l \fe=l 
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where X!^. is the kth reahzation of the vector X deprived of the ith coordinate. 
Considering the Gaussian assumption on the generation of the data X, the 
pseudo-log-hkehhood admits a compact and simple expression (see derivation 
in Appendix A.l): 



£(K|X) = - logdet(D) - -Tr (^D-^KSKD-s j 



Ylog(27r) , (2) 



where S = n~^X'''X is the empirical covariance matrix, and D is a pxp diagonal 
matrix with elements Da = Ka. In the sequel, it will be convenient to use the 
sufficiency of S for K, and, by a slight abuse of notations, write £(K|X) = 
£(K|S). 

Following Banerjee et al. (2008), an penalty may be added to obtain a 
sparse estimate of K. Nevertheless, our approach to maximizing the pseudo-log- 
likelihood is much simpler than the optimization of the log-likelihood proposed 
by Banerjee et al. (2008). Indeed, as stated formally in the following proposition, 
maximizing the penalized pseudo-log-likelihood on the set of arbitrary matrices 
(not constrained to be either symmetric or positive definite) boils down to solv- 
ing p independent LASSO problems of size p—1. Furthermore, for the purpose of 
discovering the graph structure, additional computational savings are achieved 
by remarking that D needs not to be estimated. We thus avoid the iterative 
scheme of Rocha et al. (2008) alternating optimization with respect to D and 
to the off-diagonal elements of D~^K. 

Proposition 1. Consider the following reordering of the rows and columns of 
K and S: 

(3) 

where K\^\^ is matrix K deprived of its zth column and its zth line, and where 
K^\^ is the ith column of K deprived of its ith element. The problem 

max £(K|S)-A||K||i , (4) 

where ||K||i is the componentwise ^i-norm, can be solved column- wisely by 
considering p LASSO problems in form 
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1 

mm - 

f3eRp-^ 2 



where the optimal /3 is the maximizer of (4) with respect to K^^'Ki^i as defined 
in (3). Hence, Problem (4) may be decomposed into the p Problems (5) of size 
p — 1 generated by the p possible permutations in (3). 

The full solution to Problem (4), in {Kij : i ^ j}, also requires Ka (see Rocha 
et al. 2008). However, since our interest is to unveil the graph structure, the 
sparsity pattern of the penalized maximum likelihood estimate of K is sufficient, 
and the latter is directly recovered from /3. 

Proof. See appendix A. 2. □ 
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From the definition of the covariance matrix S, it is clear that Problem (1) 
is a slight reparameterization of Problem (5). Hence, the graph produced by 
the approach of vienishausen and Buhimann (2006) is identical to the one ob- 
tained by maximizing the penalized pseudo likelihood (Ambroise et al. 2009, 
Proposition 8). 

3. Inferring Multiple GGMs 

In transcriptomics, it is a common practice to conduct several assays where 
the experimental conditions differ, resulting in T samples measuring the expres- 
sion of the same molecules. From a statistical viewpoint, we have T samples 
belonging to different sub-populations, hence with different distributions. As- 
suming that each sample was drawn independently from a Gaussian distribution 
X(^) - N{%, X;^^)), the T samples may be processed separately by following the 
approach described in Section 2. The objective function is expressed compactly 
as a sum: 



Note that it is sensible to apply the same penalty parameter A for all sam- 
ples provided that the T samples have similar sizes and originate from similar 
distributions, in particular regarding scaling and sparseness. 

Problem (6) ignores the relationships between regulation networks. When the 
tasks are known to have strong commonalities, the multi-task learning frame- 
work is well adapted, especially for small sample sizes, where sharing informa- 
tion may considerably improve estimation accuracy. To couple the estimation 
problems, we have to break the separability in K^^\ . . . ,K^-^^ in Problem (6). 
This may be achieved either modifying the data-fitting term or the penalizer. 
These two options result respectively in the graphical Intertwined-LASSO and 
the graphical Cooperative-LASSO presented below. 

3.1. Intertwined Estimation 

In the Maximum A Posteriori framework, the estimation of a concentration 
matrix can be biased towards a specific value, say Sq ^. From a practical view- 
point, this is usually done by considering a conjugate prior on K , that is, a 
Wishart distribution W(So ^,^). The MAP estimate is then computed as if we 
had observed additional observations of empirical covariance matrix Sq. 

Here, we would like to bias each estimation problem towards the same con- 
centration matrix, whose value is unknown. An empirical Bayes solution would 
be to set Sq = S, where S is the weighted average of the T empirical covariance 
matrices. As in the maximum likelihood framework, this approach would lead 
to a full concentration matrix. Hence, we will consider here a penalized crite- 
rion, which does not exactly fit the penalized maximum likelihood nor the MAP 
frameworks, but that will perform the desired coupling between the estimates 
of K'^^^ . . . , K*^^^ while pursuing the original sparseness goal. 

Formally, let ni, . . . , be the sizes of the respective samples, whose empir- 
ical covariance matrices are denoted by . . . , S*^^^. Also denote n = "^Ut, 
we consider the following problem: 



T 




max 




(6) 



t = l ^=1 
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max ^(/:(K(^)|S(^))-A||K(^)||i) , (7) 

where S*^^^ = aS'^^^ + (l— a)S and S = Y^J=i ^tS*^^^. As this criterion amounts 
to consider that we observed a blend of the actual data for task t and data from 
the other tasks, we will refer to this approach as intertwined estimation. The 
idea is reminiscent of the compromise between linear discriminant analysis and 
its quadratic counterpart performed by the regularized discriminant analysis 
of Friedman (1989). Although the tools are similar, the primary goals differ: 
Friedman (1989) aims at getting a control on the number of effective parameters, 
we want to bias empirical distributions towards a common model. 

In order to avoid multiple hyper-parameter tuning, the results shown in the 
experimental section were obtained with a arbitrarily set to 1/2. More refined 
strategies are left for future work. 



3.2. Graphical Cooperative-LAS SO 

The second approach consists in devising penalties that encourage similar spar- 
sity patterns across tasks, such as the Group-LASSO (Yuan and Lin 2006), 
which has already inspired some multi-task learning strategies (Argyriou et al. 
2008), but was never considered for learning graph models. We shortly describe 
how Group-LASSO may be used for inferring multiple graphs before introduc- 
ing a slightly more complex penalty that was inspired by the application to 
biological interactions, but should be relevant in many other applications. 

As in the single task case, sparsity of the concentration matrices is obtained 
via an ii penalization of their entries. An additional constraint imposes the sim- 
ilarity between the two concentration matrices. Each interaction is considered 
as a group. 

The Group-LASSO is a mixed norm that encourages sparse solutions with 
respect to groups, where groups form a pre-defined partition of variables. In the 
GGM framework, by grouping the partial correlations between variables across 
the T tasks, such a penalty will favor graphs ^i, . . . , with common edges. 
The learning problem is then 

max ^£(K(*)|S«) -X^ft {kS^YY'- 

t=l i^j ^ t=l ' 



Though this formalization expresses some of our expectations regarding the 
commonalities between tasks, it is not really satisfying here since we aim at in- 
ferring the support of the solution (that is, the set of non-zero entries of K*^^^). 
To enable the inference of different networks (t^u)^ we must have some (i,j) 
such that K^^ = and K^^^ 7^ 0. This event occurs with probability zero with 
the Group-LASSO, whose variables enter or leave the support group- wise (Yuan 
cMiu Lin 2006). However, we may cure this problem by considering a regular- 
ization term that better suits our needs. Namely, when the graphs represent 
the regulation networks of the same set of molecules across experimental con- 
ditions, we expect a stronger similarity pattern than the one expressed in (8). 
Specifically, the co-regulation encompasses up-regulation and down-regulation 
and the type of regulation is not likely to be inverted across assays: in terms of 
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partial correlations, sign swaps are very unlikely. This additional constraint is 
formalized in the following learning problem: 

T 

max y £(K(*)|S(*)) 

-ae((e«)'-;(eh;o: 

where [u]^ = max(0,ii). 

Figures 1 and 2 illustrate the role of each penalty on a problem with T = 2 
tasks and p = 2 variables. They represent several views of the unit balls 




i=l ^ t=l 



1/2 ^ ^ ^ 2 \ 

^ ' <1 , 



E(E(ft't) +(EM"): 



=1 ^t=l 

that is, the admissible set for a penalty for a problem with two tasks and two 
features. 

These plots also provide some insight on the sparsity pattern that originate 
from the penalty, since sparsity is related to the singularities at the boundary 
of the admissible set (Nikolova 2000). In particular, the first column illustrates 
that, when /3^^ is null, /32^^ may also be exactly zero, while the second column 

(2) 

shows that this event is improbable when differs from zero. The second row 
illustrates the same type of relationship between /3[^^ and /3[^^ that are expected 
due to the symmetries of the unit ball. 

Figure 2 corresponds to a Cooperative-LASSO penalty. These plots should be 
compared with their Group-LASSO counterpart in Figure 1. We see that there 
are additional discontinuities in the unit ball resulting in new vertices on the 3-D 
plots. As before, we have that, when /3^^ is null, /32^^ may also be exactly zero, 
but in addition, we may also have or ^l^"^ exactly null. Accordingly, in the 
second and third row, we see that we may have /32^^ null when /3^^ is non-zero. 
These new edges will result in some new zeroes when the Group-LASSO would 
have allowed a solution with opposite signs between tasks. 

The second main striking difference with Group-LASSO is the loss of the axial 
symmetry of the Cooperative-LASSO when some variables are non-zero. These 
plots illustrate that the decoupling of the positive and negative parts of the 
regression coefficients in the penalty favors solutions where these coefficients are 
of same sign across tasks. The penalties are identical in the positive and negative 
orthant, but the Cooperative-LASSO penalization is more stringent elsewhere, 
when there are some sign mismatches between tasks. The most extreme situation 
occurs when there is no sign agreement across tasks for all variables. In the setup 
represented here, with only two tasks, the effective penalty then reduces to the 
LASSO. 
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Fig 1. Admissible set for the Group-LASSO penalty for a problem with two tasks and two 
features. Top row: cuts of the unit ball through {^[^\ ^2^"^ ) for various values of (3^\ 
where span the horizontal plane, and (32^^^ is on the vertical axis; bottom rows: 

cuts through {(3[^\ 132^'^) for various values of and 13^^). 
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Fig 2. Admissible set for the Cooperative-LAS SO penalty for a problem with two tasks and 
two features. Top row: cuts of the unit ball through {/3[^\ /^2^'*) for various values of /3^\ 
where {(3[^\l3[^^) span the horizontal plane, and /S!^ is on the vertical axis; bottom rows: 
cuts through ( /9 J \ for various values of {^[^'^ and (3^^). 



Chiquet, Grandvalet and Ambroise/ Inferring Multiple Graph Structures 



10 



4. Algorithms 

In this section, we describe the strategy proposed for solving the three opti- 
mization problems introduced above, based upon the proposal of Osborne et al. 
(2000a) for solving the LASSO. This part also draws its inspiration from Os- 
borne et al. (2000b), Kim et al. (2006), Roth and Fischer (2008). 



4-1' Problem Decomposition 



The multiple independent tasks Problem (6) can be solved by considering either 
T single tasks like (4) of size (p — 1) x p (each one possibly decomposed in p 
LASSO sub-problems of sizep— 1), or a single large problem of size Tx{p—1) xp, 
which can be decomposed into p LASSO sub-problems of size {p— 1) x T, through 
Proposition 1. This line of attack is not computationally efficient here, but it 
will become advantageous when considering the penalties presented in Section 
3.2. It is introduced at this point to provide a unified conceptual view of all 
algorithms. 

Consider the {p T) x {p T) block-diagonal matrix C composed by the empirical 
covariance matrices of each tasks 



C = 



\ 



and define 







\ 


















[ 









(10) 



The (jp— 1) T X (jp— 1) T matrix C^^^^ is the matrix C where we removed each line 
and each column pertaining to variable i. We define C, C>\i\i and C>i\i similarly, 
with S^*^ being replaced by S^^^ for each t = 1, . . . , T in the above definitions. 

Let (3^^^ G R^^""*^^ denote the vector estimating K^^^^, defined from K^^^ as 
in (3), and let (3 G R^><(^^~^) be the vector of the concatenated estimates 
= (/3^^^^, • • • ,/3^"^^^). The optimization of (6) is achieved by solving p sub- 
problems in form: 



mm - 

/3^IRTx(p-l) 2 



^\i\if^ + ^\i\i'^i\i 



^jz^rmi ' (11) 

t=l 



Note that we do not need to perform the costly matrix operations that are 
expressed in the the first term of the objective function of Problem (11). In 
practice, we compute 

/(/3;C) = Vc\iv/3 + /3'Ci\i , 



which only differs from the squared £2 norm in (11) by a constant that is irrel- 
evant for the optimization process. 



Chiquet, Grandvalet and Amhroise/ Inferring Multiple Graph Structures 



11 



Accordingly, Problems (7), (8) and (9) can be decomposed into p minimiza- 
tion sub-problems whose objective functions may be decomposed as 

Lfc(/3) = /(/3) + A5fe(/3) , (12) 

where, with a slight abuse of notation, /(/3) is either /(/3; C) for Problem (7) 
or /(/3; C) for Problems (8) and (9), and where gk{l3) stands for the penalty 
functions respectively defined below: 

• for the graphical Intertwined LASSO 




1 



• for the graphical Group-LASSO 

.2(/3)=E|K"'L ' 

i=l 

where P^l'^^ = • • • , /^l^^^ ^ is the vector of the ith component 

across tasks; 

• for the graphical Cooperative-LAS SO 

^3(/3)=E 

i=l ^ 

Since / is convex with respect to /3, and all penalties are norms, all these 
objective functions are convex and thus easily amenable to optimization. They 
are non-differentiable at zero, due to the penalty terms, which all favor zero co- 
efficients. Bearing in mind the typical problems in biological data, where graphs 
have a few tens or hundreds nodes, and where connectivity is very weak^, we 
need convex optimization tools that are efficient for medium-size problems with 
extremely sparse solutions. We thus chose a greedy strategy that aims at solv- 
ing a series of small-size sub-problems, and will offer a simple monitoring of 
convergence. 

4-2. Solving the Sub-Problems 

The minimizers /3 of the objective functions (12) are assumed to have many zero 
coefficients. The approach developed for the LASSO by Osborne et al. (2000a) 
takes advantage of this sparsity by solving a series of small linear systems, whose 
size is incrementally increased/decreased, similarly to a column generation algo- 
rithm. The master problem is the original problem, but solved only with respect 
to the subset of variables currently identified as non-zero /3 coefficients. The sub- 
problem of identifying new non-zero variables simply consists in detecting the 
violations of the first-order optimality conditions with respect to all variables. 
When there are no more such violations, the current solution is optimal. 

-•^ Typically, the expected number of vertices in graphs to scale as the number of nodes, that 
is, we expect order of y^pT non-zero coefficients in each sub-problem of size T x (p — 1). 
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The objective functions Lk{/3) are convex and smooth except at some loca- 
tions with zero coefficients. Thus, the minimizer is such that the null vector 
G W~'^ is an element of the sub differential d(3Lk{l3). In our problems, the 
subdifferential is given by 

d(3Lum = V^/(/3) + \d(,gum , (13) 

where V [3f{(5) = C\i\i/3 + Ci\i and where the form of dfsgki^) differs for the 
three problems and will be detailed below. 

The algorithm is started from a sparse initial guess, that is, /3 = or, if 
available, the solution obtained on a more constrained problem with a larger 
penalization parameter A. Then, one converges to the global solution iteratively, 
by managing the index A of the non-zero coefficients of /3 and solving the master 
problem over where the problem is continuously different iable. The manage- 
ment of A requires two steps: the first one removes from A the coefficients 
that have been zeroed when solving the previous master problem, ensuring its 
differentiability at the next iteration, and the second one examines the candi- 
date non-zero coefficients that could enter A. In this process, summarized in 
Algorithm 1, the size of the bigger master problems is typically of the order of 
magnitude of the number of non-zero entries in the solution. Solving the mas- 
ter problem with respect to the non-zero coefficients can be formalized as 
solving minhi^/c(/3^ + h), where h G RI-^I is optimal if G d^Lki/S^ + h). 



Algorithm 1: General optimization algorithm 

// 0. INITIALIZATION 
/3 ^ 

A^d) 

while ^ a^L(/3) do 

// 1. MASTER PROBLEM: OPTIMIZATION WITH RESPECT TO 

Find a (approximate) solution h to the smooth problem 

Vh/(/3^ + h) + Aah^fc(/3^ + h) = . 
// where di^gk = {Vhfi'fc} 

// 2. IDENTIFY NEWLY ZEROED VARIABLES 

while 3i e A: 6i=0 and min I + A^l = do 

^ A^A\{i} 
// 3. IDENTIFY NEW NON-ZERO VARIABLES 

// Select i G A^ such that an infinitesimal change of /Si provides the 
highest reduction of 

i ^ arg max-Uo, where Vn = min ^l^P^ + 

if i;^ / then 

I A^A\J{i} 
else 

|_ Stop and return /3, which is optimal 
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Implementation Details 

We provide below the implementation details that are specific to each optimiza- 
tion problem. Specificity of each problem relies on djsgki/S), denoted 6 herein. 

Intertwined LASSO — This LASSO problem is solved as proposed by Os- 
borne et al. (2000a), except that we consider here the Lagrangian formulation 
with A fixed. 

The components of in the subdifferential (13) read 

if ft = then Oi e [-1, 1] , else Oi = sign(ft) . 

Solving the master problem on A requires an estimate of 6^^ at + h . It is 
computed based on a local approximation, where the components of sign(/3^+h) 
are replaced by sign(/3^). ^ This leads to the following descent direction h: 



h = -/3^ 



C\-i.(A^)(c,v(^) 



where, in order to avoid double subscripts, we use the notation M(^, ^) for 
the square submatrix of M formed by the rows and columns indexed by A, and 
v(^) for the subvector formed by the columns of v indexed by A. 

Then, before updating /3^, one checks whether the local approximation used 
to compute h is consistent with the sign of the new solution. If not the case, 
one looks for the largest step size p in direction h such that = + ph 
is sign-consistent with /3^. This amounts to zero a coefficient, say ft, and i is 
removed from A if |9/(/3^)/9ft| < A, otherwise, the corresponding Oi is set to 
— sign(9/(/3^)/9ft) . In any case, a new direction h is computed as above, and 
is updated until the optimality conditions are reached within A. 

Finally, the global optimum is attained if the first-order optimality conditions 
are met for all the components of /3, that is, if /3 verifies 

G C\,\ J + c,\, + xe , 



where 6 is such 

e^ = sign(3^) and < 1 . 

Graphical Group-LASSO — In this problem, the subdifferential (13) is con- 
ditioned on the norm of /3- ' , the vector of the ith component across tasks. 



Let e 



■1:T] 



(0^^\. . . , e be deffned similarly to /3f we have that 



if 



/3i 



1:T] 



then 



< 1 



else 6 



l-.T] 



1:T] 



/3l 



1:T] 



where, here and in what follows, 0/0 is deffned by continuation as 0/0 = 0. As 
the subgradient w.r.t. /3- * ^ reduces to a gradient whenever one component of 
l3\ ' Ms non-zero, the management of the null variables is done here by subsets 



^When A is updated and that /Si = 0, the corresponding 6i is set to —sigii{df{l3)/dl3i) . 
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of T variables, according to 



V^|l:T,/(/3) 



, instead of the one by one basis of 



the LASSO. Hence, we only need to index the groups l}in^. 
Here also, solving the master problem on A requires an estimate of ^ at 



/3 



;i:T] 
A ' 
,[1:T] 



- h . Provided that 



/3[ 



[i:r] 



7^ for all i G A, 6 is differentiable w.r.t. 



^ . It will thus be approximated by a first-order Taylor expansion, resulting 
in a Newton- Raphson or quasi- Newton step. Here, we used quasi- Newton with 



BFGS updates. Note that, whenever (3, 



;i:T] 



0, that is, when a new group of 



variables has just been activated or is about to be deactivated, the corresponding 



1:T] 



is set so that 



V^|l:T,/(/3)- 

[1:T] 



\1:T] 



(14) 



is minimum (that is, with 6\ ' ^ proportional to V^[i:t]/). The updates of A are 
also based on the minimal value of (14). 

Graphical Cooperative-LASSO — As the Group-LASSO, the Cooperative- 
LAS SO considers a group structure, but its implementation differs considerably 
from the former in the management of A. Though several variables are usually 
activated or deactivated at the same time, they typically correspond to subsets of 
/3 • ■ , and these subsets are context-dependent; they are not defined beforehand. 
As a result, the index of non-zero /3- ' Ms better handled by considering two 



sets: the index of (3 
and A- 



[1:T\ 



with positive and negative components: 



|i G 1} 
i G 1} 



>0 



> 



} 



Let T denote the index of non-zero entries of /3 • ' , with complement T^; the 
subdifferential at the current solution is such that: 



Mi e A^nA^L, then 



max 



< 1; 



if i G yl^j. n ^_ then 



i e A^ n^i, then 







r 



< 1 and 
< 1 and 



r 



I3j , 



{or). 



0; 
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if i G ^+ n^_, then 



(sign(/3f )/3 



[1:T] 



t = 1, 



Once ^+ and A are determined, the master problem is solved as for the Group- 
LASSO, with BFGS updates, with box constraints to ensure sign feasible j3\ ' ^ 
for i such that i G A\ f^A- or z G Aj^ {~^A^_. When a new variable has just been 



activated or is about to be deactivated, the corresponding 6 



[1:T] 



is set so that 



(15) 



is minimum. The updates of A are also based on the minimal value of (15). 



5. Experiments 

In most real-life applications, the major part of the inferred graphs are unknown, 
with little available information on the presence/absence of edges. We essentially 
face an unsupervised learning problem, where there is no objective criterion 
allowing to compare different solutions. As a result, setting the hyper- parameters 
is particularly troublesome, alike, say, choosing the number of components in 
a mixture model, and it is a common practice to visualize several networks 
corresponding to a series of penalties. 

Regarding the first issue, we chose to present here synthetic and well-known 
real data that allow for an objective quantitative evaluation. Regarding the 
second issue, the problem of choosing penalty parameters can be guided by the- 
oretical results that provide a bound on the rate of false edge discovery (Mein- 
shausen and Biihlmann 2006, Banerjee et ai. 2008, Ambroise et al. 2009), or by 
more traditional information criteria targeting the estimation of K (Yuan and 
Lin 2007, Rocha et al. 2008). However, these proposals tend to behave poorly, 
and it is a usual practice to compare the performance of learning algorithms 
by providing a series of results, such as precision-recall plots or ROC-curves, 
letting the choice of penalty parameters as a mostly open question for future 
research. Although the shortcomings of this type of comparison are well-known 
(Drummond and Holte 2006, Bengio et al. 2005), we will use precision vs. recall 
plots, which can be considered as valuable exploratory tools. 

Precision is the ratio of the number of true selected edges to the total number 
of selected edges in the inferred graphs. Recall is the ratio of the number of true 
selected edges in the inferred graphs to the total number of edges in the true 
graphs. In a statistical framework, the recall is equivalent to the power and the 
precision is equivalent to one minus the false discovery proportion. 



5.1. Synthetic Data 
5.1.1. Simulation Protocol 

To generate T samples stemming from a similar graph, we first draw an "ances- 
tor" graph with p nodes and k edges according to the Erdos-Renyi model. Here, 
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we consider a simple setting with T = 4 and a network with p = 20 nodes and 
k = 20 edges, as ihustrated in Figure 3. Then, T children graphs are produced 
by random addition and deletion of S edges in the ancestor graph. The T con- 
centration matrices are built from the normalized graph Laplacians, whose off- 
diagonal elements are slightly deflated to produce strictly diagonally dominant 
matrices. To allow for positively and negatively correlated variables, we generate 
a strictly triangular matrix of random signs drawn from a Rademacher distribu- 
tion. This matrix is symmetrized, complemented with ones on the diagonal, and 
its component-wise multiplication with the deflated Laplacians produces the 
ground-truth for the concentration matrices K^^\ . . . , K^-^^. Each K^^^ is finally 
used to generate n Gaussian vectors with zero mean and covariance K*^^^ 

5.1.2. Experimental Setup 

The precision-recall plots are computed by considering the cumulative number 
of true and false edge detections among the T = 4 children networks. Let f 
be the set of edges for children network precision and recall are respectively 
formaly defined as: 

t E i(^i^) E E i(^y^) 

t=i ii,j)esw t=i {i,j)es(t) 

^iid , 

EEK^S^) E|^^*^ 

t=l i>j t=l 

where K^j^ is the estimated partial correlation between variables i and j for 
network t, l{u) returns 1 if u ^ and otherwise, and |f | is the cardinal of S. 

To ensure representativeness, the precision-recall figures are averaged over 
100 random draws of the ancestor graph, the averaging being performed for 
fixed values of the penalization parameter A. That is, each point in the (preci- 
sion, recall) plane is the average of the 100 points obtained for each random draw 
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of the ancestor graph, for a given estimation method and for a given value of 
the penahzation parameter. We compare our proposals, namely the Graphical 
Intertwined {a = 1/2), Cooperative and Group LASSO to two baselines: the 
original neighborhood selection of Meinshausen and Biihlmann (2006), either 
applied separately to each graph (annotated "independent"), or computed on 
the data set merging the data originating from all graphs (annotated "pooled"). 

5.1.3. Results 

Figures 4, 5 and 6 display precision-recall plots for nine prototypical situations. 
From Figure 4 to 6, the sample size increases, and from top to bottom, the differ- 
entiation between networks increases. First, note that the independent strategy 
is not influenced by the level of perturbation, yet only by the sub-sample size, 
as expected. 

The top graph in Figure 4 represents the small-sample low-perturbation 
situation, where merging data sets is a good strategy, leveraging the inde- 
pendent analysis. The latter performs poorly, and our multi-task approaches 
dominate the pooled strategy, the Cooperative-LASSO being superior to the 
Group-LASSO, which has the advantage on the Intertwined LASSO. The 
medium/large-sized-sample low-perturbation (top graph in Figures 5 and 6), 
small/medium-sized-sample medium-perturbation (middle graph in Figures 4 
and 5) and small-sized-sample large-perturbation (bottom graph in Figure 4) 
are qualitatively similar, with less differences between all competitors. For the 
large-sample medium-perturbation (middle graph in Figure 6) and medium- 
sized-sample low-perturbation (bottom graph in Figure 5) cases, all methods 
perform similarly. There is a slight advantage for the multi-task strategies for 
high recalls, and a slight advantage for the independent analysis for low recalls 
(that is, for high penalization, where there is less effective degrees of freedom 
to determine). The bottom graph in Figure 6 represents the large-sample high- 
perturbation situation, where merging data is a bad strategy, since the networks 
differ significantly and there is enough data to estimate each network indepen- 
dently. The independent strategy works best, closely followed by the Intertwined 
LASSO. The Cooperative and Group-LASSO behave equally well for high recalls 
(low penalization parameters), but for highly penalized solutions (low recalls), 
they eventually become slightly worse than the pooled estimation. 

These experiments show that our proposals are valuable, especially in the 
most common situation where data are scarce. Among the baselines, the usual 
pooled sample strategy is good in the small-sample low-perturbation, and the 
opposite independent strategy is better in the large-sample high-perturbation 
case. The intertwined LASSO is very robust, in the sense that it always per- 
forms favorably compared to the best baseline method over the whole spectrum 
of situations. Furthermore, except for the large-sample high-perturbation case, 
the Group-LASSO performs even better, and the Cooperative-LASSO improves 
further the supremacy of the multiple graph inference approach. 

5.2. Protein Signaling Network 

Only a few real data sets come with a reliable and exhaustive ground-truth al- 
lowing quantitative assessments. We make use of a multivariate flow cytometry 
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Fig 4. Precision-recall curves for the Intertwined, Cooperative, Group and the two baseline 
LASSO, for inferring four graphs (each with p = 20 nodes, k = 20 edges and a perturbation 
6 from the ancestor graph) from four samples of size ut =25. 
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Fig 5. Precision-recall curves for the Intertwined, Cooperative, Group and the two baseline 
LASSO, for inferring four graphs (each with p = 2^ nodes, k = 20 edges and a perturbation 
S from the ancestor graph) from four samples of size ut = 50. 
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Fig 6. Precision-recall curves for the Intertwined, Cooperative, Group and the two baseline 
LASSO, for inferring four graphs (each with p = 20 nodes, k = 20 edges and a perturbation 
6 from the ancestor graph) from four samples of size ut = 100. 
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data set pertaining to a well-studied human T-cell signaling pathway (Sachs 
et al. 2005). The latter involves 11 signaling molecules (phospholipids and phos- 
phorylated proteins) and 20 interactions described in the literature. The signal- 
ing network is perturbed by activating or inhibiting the production of a given 
molecule. Fourteen assays have been conducted, aiming to reveal different part 
of the network. Here, we used only four assays (inhibition of PKC, activation of 
PKC, inhibition of AKT, activation of PKA). 

Graphs inferred using only one assay at a time show that each assay really 
focus on different part of the network (see Figure 7). 




Fig 7. Four graphs inferred from single assay. From left to right, top to bottom, we have 
respectively graphs inferred from an assay: inhibiting akt, activating pka, inhibiting pkc, ac- 
tivating pkc. Thick black lines represent true positive and thin red lines are false positive. 




Fig 8. Ground-truth pathway (left) and graph sum of the four graphs estimated by Intertwined 
LASSO using all data (right). Thick black lines represent true positive and thin red lines are 
false positive. 

When considering a strategy based on inference from multiple assays, the 
first false positive inferred by the Intertwined Graphical LASSO occurs when 
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11 true interactions out of 20 are detected (see Figure 8). This edge, between p38 
and Jnk, is in fact due to an indirect connection via unmeasured MAP kinase 
kinases (Daclis et ai. 2uuo), which is a typical problem of confounding arising 
in this context. Considering partial correlations within the subset of available 
variables, the edge is correctly detected, but it is a false positive with respect 
to the biological ground truth. Furthermore, in larger biological networks, the 
absence of edge in the ground truth pathway often merely means that there is yet 
no evidence that the co-regulation exists. As a result, most real data evaluation 
of graph inference methods are based on qualitative subjective assessments by 
experts. 

This caveat being, the various inference algorithms behave here as in the 
synthetic experiments: all inference methods perform about equally well for 
large samples (each assay consists here of about 1000 repeated measurements). 

Figure 9 displays the results obtained for small sample sizes. Here also, the 
precision-recall plots are averaged over 100 independent random draws of sam- 
ples of size nt, that is n = 4n^ observations over the four considered assays. It is 
worth noticing that the large sample size limit is almost obtained for rit = 20. 
As for the synthetic experiments, the averaging is performed for fixed values of 
the penalization parameter A. In this situation, our proposals dominate the best 
baseline strategy, which is pooled estimation. Again, the Intertwined LASSO is 
very robust, but the Group-LASSO, and to a greater extent the Cooperative- 
LASSO perform better in the small-sample-size regime. 

6. Conclusion 

This paper presents the first methods dedicated to the inference of multiple 
graphs in the Gaussian Graphical Model framework. In this setup, the two 
baseline approaches consist in either handling the inference problems separately 
or as a single one by merging the available data sets. Our proposals, motivated 
by bioinformatics applications, were devised to describe the dependencies be- 
tween pairs of variables in analogous operating conditions, such as measurements 
recorded in different assays. This situation occurs routinely with omics data. 

Our approaches are based on the neighborhood selection of Meinshausen and 
Biihlmann (zuub). The first one, the Intertwined Graphical LASSO, relaxes 
the uniqueness constraint that is implicit when the tasks are processed as a 
single one, merely biasing the results towards a common answer. Our second 
approach, the Graphical Cooperative-LASSO, is based on a group-penalty that 
favors similar graphs, with homogeneous dependencies between the same pairs 
of variables. Homogeneity is quantified here by the magnitude and sign of partial 
correlations. The Cooperative-LASSO contrasts the Group-LASSO in being able 
to infer differing graph structures across tasks. Our experimental results show 
that our proposals are valuable and robust, consistently performing at least as 
well as the best of the two baseline solutions. 

The algorithms developed in this paper are made available within the R- 
package simone from version 1.0-0 and later. This package also embeds extension 
of the multi-task framework to time-course data, that is, when transcriptomics 
data are collected by considering the same individual across time. This imple- 
mentation builds on the ^i-penalized VAR(l) model described in Charbonnier 
et al. (2010). 
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Fig 9. Precision-recall curves for the Intertwined, Cooperative, Group and the two baseline 
LASSO, for inferring the graphs on four assays of Sachs^ data from four samples of size 
nt = 7, 10 and 20. 



Chiquet, Grandvalet and Ambroise/ Inferring Multiple Graph Structures 



24 



As future work, we will provide a theoretical analysis of the Cooperative- 
LASSO regarding uniqueness of the solution and selection consistency, or spar- 
sistence, that corresponds here to the asymptotic convergence of the set of de- 
tected edges towards the set of true edges. 
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Appendix A: Proofs 

A.l. Derivation of the pseudo-log-likelihood 

We show here that the pseudo-log-likelihood 

P / n \ 
/:(K|X) = ^K]logP(X^l4,;K0 , (16) 

i=i \k=i J 

associated to a sample of size n drawn independently from the multivariate 
Gaussian vector X ~ A/'(Op, T,) reads 

£(K|X) = |logdet(D) - ^IV (d'^KSKD-^) - ^log(27r) , 

where S = n~^XJ^ is the empirical variance- covariance matrix and D is the 
diagonal matrix such that Da = Ka, for i = 1, . . . 

Proof. Since the joint distribution of is Gaussian, the distributions of 
conditioned on the remaining variables X^. are also Gaussian. Their parameters 

{ji^^cji) are given by 

/^i" = ^]\i^\i\i^\i ' = ~ ^^\^^\A^^*V ' (^'^) 

where is matrix 5] deprived of its ith column and its ith line, is the 

ith column of matrix I] deprived of its ith element. 

As K = X)""*^, reordering the rows and columns of the matrices yields 





X 






'Ip-i 0" 


_ ^]\i . 






1 



where K\^\^ is matrix K deprived of its ith column and its ith line, K^\^ is the 
ith column of matrix K deprived of its ith element, and Ip-i is the identity 
matrix of size p — 1. Two of these blockwise equalities are rewritten as follows: 

E,, = (1 - ^J^^K,\,)/Ku , 

^VV^A* ^ —'^i\i/^ii • 
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Using the above identities in (17), we obtain 

a, = (1 - ^^^,\^/Ku + 5:J\,K,\,/i^,, = l/Ku , 

where = [ji], . . . , li^Y . 

Using these notations and the corresponding blockwise notations for S {Su = 
n~-^XjX^, Si\i = n~^X.J.'Xi and S\i\i = n~-^XJ.X\i), Equation (16) reads 



p p 




= I log det D - f log(27r) - | E ^ (KJSK.) , 

i=l 

where is the ith column of K and D is the diagonal matrix such that Da = 
Kii. Finally, we use that Y]l=i (KjSK^) = Tr(D-iKSKD-i) to conclude 
the proof. □ 



A. 2. Blockwise Optimization of the pseudo-log-likelihood 

Proof of Proposition 1. From (18), we have 



/:(K|S) = -| E (2SJ\,K,\, + -^KJ^^S\,\,K,\, ) + c, (19) 



2 ' 



where c does not depend on Kij with j ^ i. Thus, if we discard the symmetry 
constraint on K, maximizing the pseudo-likelihood (19) with respect to the non- 
diagonal entries of K amounts to solve p independent maximization problems 
with respect to K^\^ , i = 1, . . . ,p. The summands of (19) can be rewritten as 



^ii'^^\i\i'^i\i + ^\i\i ^i\i 



2 



2 



where c' = n/2 i^^^Sj^^S\^\^S^\^ does not depend on Kij with j ^ i. Adding 

an ii penalty term on K^\^ and defining /3 = i^^^^K^\^ leads to the objective 
function of Problem (5), which concludes the proof. □ 

A. 3. Suhdifferential for the Cooperative-LAS SO 

By definition, for a convex function ^, the subdifferential is 

d9\,3, = : V/3, 5(/3) - 5(/3o) > 0^(/3 - /3o)} 
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The function g{l3) = \\{^)_^\\^ + ||(— /^j^H^ has kinks whenever /3 has at least 
one zero component and that it has either no positive or no negative component. 
There are thus three situations where the subdifferential does not reduce to the 
gradient : 

1. ||(/3o)+||2 = Oand \\{-f3o)+\\, ^ 0, 

2. ||(/3o)+||2 7^0and ^{-^,)J^ = 0, 

3. ||(/3o)+||2 = and ||(-/3o)+||2 = 0, i.e. Po = 0- 

For the first situation, denoting A the index of non-zero entries of I3q and 
its complement, the subdifferential is defined as 

{||(-/3o)+||2"'/3o + ^:^^ = 

and , ||(/3^c)+||2 > 9^.(3^.} . (20) 

The set of admissible is explicitly given by 

{e :eA = 0, ||(6l^c)^||2 < 1 and \\{-eAc)+\\^ = . (21) 

Proof. We first show that, for any in the set defined in (21), the inequality in 
definition (20) always holds. Dropping the subscript A^ for readability, we have: 

6^13 = {0)1 (3- {-6)1 f3 
= (0)1/3 

= (<(/3)+-(<(-/3)+ 

(0)I(/3)+<||(0)+IU|(/3)J|,<||(/3)J|, . 



< 



To finish the proof, it is sufficient to exhibit some /3 such that the inequality 
in definition (20) does not hold when ||(^)^||2 > 1 or when ||(— ^)^||2 > 0. 

For ||(^)+||2 > 1, we choose /3 = (6)^, yielding 6^^ = ||(^)+||2, and 

ll(/^)+ll2 = IIW+II2 < II h^^^^ Uf^hh < ^^^'^ ll(-^)+ll2 > 0' 
choose (5 = - (-6/)^, yielding 0^ (5 = ||(-6^)+||2 > 0, and \\{l3)^\\^ = 0, hence 
||(/3)+||2<^"'/3. □ 
The second situation is treated as the first one, yielding 

99\0^ = {\\(f3o)+\\~^ f3o + 0:6^^ = 0, \\(-e^.)^\\^ <1 

and 11(0^0)^11^ = 0} . 
For the last situation, the subdifferential, defined as 

dg\f,^ = {d:W, \m+\\^ + \\{-p)+\\^>e^p] , (22) 

reads 

dg\p^ = {d:ma^(\\{d)4^,\\{-d)4^) <l} , (23) 
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Proof. We first show that, for ah the elements of dg as exphcitly defined in (23), 
the inequality in definition (22) always holds: 

0^(3 = {0)1 (3 -{-0)1 (3 

< (<(/3)+ + (-<(-/3)+ 

< ||WJU|(/3)J|, + ||M)JU|(-/3)+||, 

< max(||(0)J|^,||(-0)J|J(||(/3)J|^+||(-/3)J|^) . 

To finish the proof, it is sufficient to exhibit some /3 such that the inequality 
in definition (22) does not hold for max (^|| (^)^ , || (-^)+ H^) > 1- Without 
loss of generality, we assume ||(^)^||2 > 1, and choose (3 = (^)^, yielding 

e^(3 = \\{e)4l and \\{f3)4^ + ||(-/3) = ||(/3)J|^ = ||(^) < \\{e)4l 

hence \\{^)^\\^ + l|(-/5)+ll2 < ^' ° 
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